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Abstract. Discovering a concise schema from given XML documents 
is an important problem in XML applications. In this paper, we focus 
on the problem of learning an unordered schema from a given set of 
XML examples, which is actually a problem of learning a restricted regu¬ 
lar expression with interleaving using positive example strings. Schemas 
with interleaving could present meaningful knowledge that cannot be 
disclosed by previous inference techniques. Moreover, inference of the 
minimal schema with interleaving is challenging. The problem of find¬ 
ing a minimal schema with interleaving is shown to be NP-hard. There¬ 
fore, we develop an approximation algorithm and a heuristic solution 
to tackle the problem using techniques different from known inference 
algorithms. We do experiments on real-world data sets to demonstrate 
the effectiveness of our approaches. Our heuristic algorithm is shown to 
produce results that are very close to optimal. 
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1 Introduction 

When XML is used for data-centric applications such as integration, there may 
be no order constraint among siblings [1]. Meanwhile, the relative order within 
siblings may be still important. For example, consider a ticket system with two 
ticket machines, where there are two bunches of tourists lining up waiting to 
buy tickets. Each group has two tourists. We can then define the unordered 
schema for the ticket system. The ordered groups preserve only the relative 
order of their members. This not only allows individual tourists to insert them¬ 
selves within a group, but also lets two groups interleave their members. The 
exact XML Schema Definition (XSD) for the purchasing sequence can be essen¬ 
tially represented as gl.ml* gl.m2* g2.ml* g2.m2* \g2.ml*g2.m2*gl.ml*gl.m2* 
\gl.ml*g2.ml*gl.m2*g2.rn2* \g\.m\*g2.m\*g2.m2*g\.m2* \g2.ml*gl.ml*g2.m2* 
gl.m.2* \g2.ml*gl.ml*gl.m2*g2.m2*, where gi.mj* means the jth member in 
the ith group can buy zero or more tickets. It shows the length of the exact reg¬ 
ular expression can be exponential when compared to the number of members 
in sequences. 
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Actually, ( gl.ml\gl.m2\g2.ml\g2.m2 )* is used in practice [3] instead of the 
minimal ones, which may permit invalid XML documents (i.e., over-permissive). 
For example, it may permit the second member in the sequence of the first 
group to purchase tickets before the first member. There are many negative 
consequences of over-permissive [3]. Thus it is necessary to study how to infer 
an unordered minimal schema for this kind of XML documents. 

Previous researches on XML Schema inference have been done mainly in the 
context of ordered XML, which can be reduced to learn regular expressions. 
Gold [9] showed the class of regular expressions is not identifiable in the limit. 
Therefore numerous papers (e.g.[2,5,6,12]) studied inference algorithms of re¬ 
stricted classes of regular expressions. Most of them were based on properties 
of automata. Bex et al. [2] proposed learning algorithms for single occurrence 
regular expressions (SOREs) and chain regular expressions (CHAREs). Frey- 
denberger and Kotzing [12] gave more efficient algorithms learning a minimal 
generalization for the above classes. The approach is based on descriptive gen¬ 
eralization [12] which is a natural extension of Gold-style learning. 

However, there is no such kind of automata for regular expressions with 
interleaving since they do not preserve the total order among symbols. Thus we 
have to explore new techniques. While Ciucanu [13] proposed learning algorithms 
for two unordered schema formalisms: disjunctive multiplicity schemas (DMS) 
and its restriction, disjunction-free multiplicity schemas (MS), both of them 
disallow concatenation within siblings. Thus they are less expressive than ours. 
Moreover, the ordering information in our schema formalism can not be fully 
captured by the three characterizing triples used to construct a DMS or MS. 

Inference algorithms in this paper use some similar techniques with algo¬ 
rithms mining global partial orders from sequence data [14,15,17]. However, 
the semantic concepts there are typically quite different from ours. Mannila et 
al. [15] tried to find mixture models of parallel partial orders. However, to learn 
unordered regular expressions, series parallel orders may not be sufficient since 
they can conflict with some data in the whole data set. Another restriction in 
the above method is that it can only be applied to strings where each symbol 
occurs at most once. Particularly, Gionis et al. [14] emphasised on recovering 
the underlying ordering of the attributes in high-dimensional collections of 0-1 
data. An implicit assumption is that attribute can also occur at most once. For 
learning regular expressions with interleaving, symbols in strings can present 
any times and partial orders among siblings are independent with no violations. 
Hence many techniques from data mining are not directly applicable. Therefore, 
learning restricted regular expressions with interleaving remains a challenging 
problem. 

In this paper, we address the problem of discovering a minimal regular ex¬ 
pression with interleaving from positive examples. The main contributions of the 
paper are listed as follows: 

- We propose a better and more suitable formalism to specify precise unordered 
XML: the subset of regular expressions with interleaving (SIREs). SIREs can 
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express the content models succinctly and concisely. For example, the above 
example can be depicted as {gl.ml*gl.m2*)Sz{g2.ml*g2.m2*). 

- We introduce the notion of SIRE-minimal in the terminology of [12] and some 
properties of SIRE-minimal. 

- We prove the problem of finding a minimal SIRE is NP-hard and develop an 
approximation algorithm conMiner to find solutions with worst-case quality 
guarantees and a heuristic algorithm conDAG that mostly finds solutions of 
better quality as compared to the approximation algorithm conMiner. 

- We conduct experiments comparing our methods with Trang [8] on real world 
data, incorporating small and large data sets. Our experiments show that 
conMiner and conDAG outperform existing systems on such data. 

The rest of the paper is organized as follows. Section 2 contains basic definitions. 
In Section 3 we discuss properties of minimal-SIRE. In Section 4 an approx¬ 
imation algorithm conMiner and a heuristic algorithm conDAG are proposed. 
Section 5 gives the empirical results. Conclusions are drawn in Section 6. 

2 Preliminaries 

Let u and v be two arbitrary strings. By u&v we denote the set of strings that is 
obtained by interleaving of u and v in every possible way. That is, it&£ = £&u = 
u, u&e = £&v = v. If both u and v are non-empty let u = av!, v = bv ', a and b are 
single symbols, then u&cv = a(u'&v)Ub(u$zv'). Let £ be an alphabet of symbols. 
The regular expressions with interleaving over £ are defined as: 0, £ or a £ A is a 
regular expression, E\, E{, E+, EiE 2 , Ei\E 2 , or E\b,E 2 is a regular expression 
for regular expressions E\ and E 2 . They are denoted as RE(&). The language 
described by E is defined as follows: L(0) = {0}; L(e) = {£}; L{a) = {a}; 
L(El) = L(E+) = L(E ,)+; L(E{) = L{E{f-, L(E X E 2 ) = L(E 1 )L(E 2 ); 

L(Ei|E 2 ) = L(Ei) U L(E 2 ); L(Ei&E 2 ) = L(Ei)&L(E 2 ). We consider the sub¬ 
set of regular expressions with interleaving (SIREs) defined by the following 
grammar. 

Definition 1. The restricted class of regular expressions with interleaving (RREs) 
are RE(Sz) over £ by the following grammar for any a € £: 

S :: = TSzS\T 

Tv. =£|a|a+|a ? |a*|TT 

The subset of regular expressions with interleaving (SIREs) are those RREs in 
which every symbol can occur at most once. Since SIREs disallow repetitions of 
symbols, they are certainly deterministic and satisfy the UPA constraint required 
by the XML specification. 

A partial order M for a string s is a binary relation that is reflexive, an¬ 
tisymmetric and transitive. We write a -< b if a is before b in the partial or¬ 
der. For string s = x\ •••£;, the transitive closure of s is denoted by tr{s) = 

{ [xi , Xj )11 < i < j < l}, where l is the length of s. For example s = abed, 
tr(s) = {ab , ac, ad, be, bd, cd }. 
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A partial-order set t is a set of symbols together with a partial ordering. We 
say ab € t if a precedes b in every string in a string collection. Consistent partial 
order set (CPOS) T is a set which contains all the disjoint partial-order sets 
ti of the given examples. For example, consider W = {abcd,dabc}. Obviously, 
a -< b -< c, T = {abc,d}. The connection between CPOS and SIRE is directly. 
That is, given a CPOS, we can write it to the form of SIRE by combining all 
the elements in CPOS with k. For example, in this case the corresponding SIRE 
s = abckd. Therefore, the problem of hireling a minimal SIRE can be reduced 
to the problem of finding a minimal CPOS. 

3 Descriptivity 

This section introduces the notion of minimal expressions. Roughly speaking 
minimal is the greatest lower bound of a language L within a class of expressions, 
which is conceptually similar with infimum in the terminology of mathematics. 

Definition 2 ([12]). Let V be a class of regular expressions over some alphabet 
E. A 8 € V is called V-minimal of non-empty language S C E* if L(5) D S, 
and there is no 7 £ V such that L(S) D L( 7 ) 3 S. 

Proposition 1. Let n be the number of alphabet symbols. The number of pair¬ 
wise non-equivalent SIREs is 0(n\). 

Proof. Disregarding operators ?,+,*, the number of SIREs over a finite E is 
equivalent to the number of ordered partitioning |A| symbols. The number of 
these partitions is given by the |I7|th ordered Bell numbers [11]. For instance, if 
E = {a, b, c}, the 3th ordered Bell number a(3) = 13, and the ordered partitions 
of {a, b, c} is {abc, acb, bac, bca, cab, eba, abkc, bake, ackb, cakb, beka, cbka, akbkc}. 
They are also distinct partitions of SIREs over E. The ordered Bell number [10] 
can be approximated as a(n) = J2k =0 ~ 2 (in 2 ) n + 1 • Since every symbol a in 

E has four forms which can be represented as a, a ? ,a + and a*, the number of 
SIREs over E is 4 n a(n). Then s(n) « 2 (in 2 ™™+ 1 • 1=1 

We can then prove the existence of minimal regular expressions for SIRE. 

Proposition 2. Let E be a finite alphabet. For every language L C E*, there 
exists a SIRE-minimal SIRE S s ■ 

Proof. Assume there is a language L over E such that no expression a £ SIRE 
is SIRE-minimal. This implies that there is an infinite sequence 0 of ex¬ 
pressions from SIRE with a = 0o and T(/3j) D L(/3j+i) 3 L for all i > 0. This 
contradicts the fact that there are only a finite number of non-equivalent SIREs 
over E by Proposition 1. □ 

Proposition 3. For any example string set E over {ai, • • • , a n }, let S = S\k ■ ■ ■ 
ksi be a SIRE such that E C L(S). S is a minimal SIRE if and only if: 

(1) the number of Sj is minimized and 

(2) the size of each Si is as large as possible. 
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The proof was omitted for space reasons. 

In other words, a minimal SIRE is the most specific SIRE that consistent with 
the given example strings. For instance, all of Si = a&zbclkd, .S '2 = cibc&d and 
S 3 = adSzbc can accept E = {abcd,adbc}. However, since S± = (ad\da)&zbc = 
(adSzbc)\(daSzbc) — Ss\(adSzbc), we can get L(S 1 ) D £(£ 3 ) which means Si 
is not minimal. As for S 2 and S 3 , since L(S 2 ) = {abed, abdc, adbc, dabc] and 
A (S 3 ) = {bead, bacd , bade, abed, abdc, adbc}, this means S 3 is not minimal. As we 
shall see, S 2 is a better approximation of E. In fact, S 2 can be verified to be a 
minimal by referring to Proposition 3. 


4 Minimal SIREs 

In this section, we first prove finding a minimal SIRE for a given set of strings 
is NP-hard by reducing from finding a maximum independent set of a graph, 
which is a well-known NP-hard graph problem [7]. Then we present learning 
algorithms that construct approximatively minimal SIREs. 

4.1 Exact Identification 

First, we introduce the notion of maximum independent set of a graph [7]. Con¬ 
sider an undirected graph G(V,E), an independent set (IS) is a set / C V 
such that Vit, v £ V, (u, v) (£ E. The maximum independent set (MIS) problem 
consists in computing an IS of the largest size. Next, we define the problem 
all_mis which takes a graph G as input, finding a MIS S' of G by applying 
function max_independent_set, and repeating the step for subgraph G[V — S 17 ] 
until there exists no vertex in the subgraph. In other words, all_mis is to divide 
V into disjoint subsets by max_independent_set. Clearly, problem all_mis is 
NP-hard. 

The main idea of finding a minimal SIRE is based on the observation that 
there are sets of conflicting siblings that cannot be divided into the same subset 
of CPOS. A pair xy is called forbid pair in a string database if both xy and 
yx exists in the transitive closure of strings. The set of forbid pairs is called a 
constraint. By Proposition 3, if we split the set of symbols in a constraint into 
several subsets t\, ■ ■ ■ ,t n such that n is minimized and for each i £ [l..n], t, is 
the longest of its alternatives. Then the set of tj where i £ [l..n], is a minimal 
CPOS which can be transformed to a minimal SIRE. 

Lemma 1. Minimal SIRE finding problem is NP-hard. 

Proof. We demonstrate that all_mis can be reduced in polynomial time to 
minimal SIRE finding problem. Given an instance of all_mis, we can generate 
a corresponding instance of minimal SIRE finding as follows. For the graph G 
in all_mis, the reduction algorithm computes the constraint set by adding 
all edges in G to constraint, which is easily obtained in polynomial time. The 
output of the reduction algorithm is the instance set constraint of minimal 
SIRE finding problem. t t in CPOS is the longest of its alternatives if and only if 
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all_mis computes a maximum independent set at the ?’th step. Thus, minimal 
SIRE finding problem is equivalent to the original all_mis. Since all_mis is 
NP-hard, minimal SIRE finding problem is NP-hard. □ 

4.2 Approximation Algorithm 

The process of this approach is formalized in Algorithm 1. Algorithm 1 works in 
four steps and we illustrate them on the sample E = {abed, aadbc, bdd}. The first 
step (lines 1 - 2 ) computes the non-constraint and constraint set using the func¬ 
tion tran_reduction. The transitive closure of E is tr = {ab, ac, ad,bc,bd, db, 
dc}. Add uv to constraint if vu £ tr. Add uv to L 2 otherwise. We get L 2 = 
{ab, ac, ad, be} and constraint = {bd, cd, db, dc}. Construct an undirected graph 
G using element in constraint as edges. The second step (lines 3-7) is to select 
a MIS of G, add it to list allmis and delete the MIS and their related edges 
from G. The process is repeated until there exists no nodes in G. The problem 
of finding a maximum independent set is an NP-hard optimization problem. As 
such, it is unlikely that there exists an efficient algorithm for finding a maxi¬ 
mum independent set of a graph. However, we can find a MIS in polynomial 
time with a approximation algorithm, e.g. the clique_removal algorithm pro¬ 
posed in [19] that finds the approximation of maximum independent set with 
performance guarantee O(n/(logn) 2 ) by excluding subgraphs. For graph G, we 
obtain allmis = {{ 6 , c}, d}. Next, we add the non-constraint symbols to the first 
MIS. Then we have allmis = {{a, b, c}, d}. The third step (lines 8-10) computes 
the topological sort for all subgraphs induced by subset of L 2 and add the result 
to T. For the sample, it returns T = {abc,d}. Finally, the algorithm returns the 
SIRE whose corresponding counting operators 1,*,+,? can be inferred using 
technique in algorithm CRX [4]. For the sample, it returns a* 6 c?&d + . 


Algorithm 1 conMiner{W) 

Input: Set of words W = {wi, ..., w n } 

Output: a minimal SIRE T 
1: L 2 , constraint = tranjreduction(W, T) 

2: G = Graph(constraint) 

3: while G.nodes{)\ = null do 
4: v = clique-removal(G) 

5: G = G-v 

6 : allmis. append(v) 

7: allmis[ 0] = allmis[0].union(alphabet(L2) — alphabet(constraint)) 
8: for each mis £ allmis do 
9: H = Graph(mis, L2) 

10: T.append(topologicalsort(H)) 

11: return learner oper (IT, T) 
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4.3 Heuristic Algorithm 

Although a number of approximation algorithms and heuristic algorithms have 
been developed for the maximum independent set problem, on any given in¬ 
stance, they may produce a SIRE that is very far from optimal. We introduce a 
heuristic directed acyclic graph construction algorithm directly computing a min¬ 
imal SIRE. The main idea is to cluster the vertices of the existing directed graph 
into several disconnected subgraphs. The graph is constructed incrementally to 
preserve CPOS within each vertex using a greedy approach. The pseudocode of 
algorithm conDAG is given in Algorithm 2. 

The input to this algorithm is the same as the input of the conMiner. The 
algorithm maintains lists p 1 q as records to keep track of pairs violating the 
partial order constraint and lists s, t to record pairs violating the partial order 
constraint of the string under reading. Note that (a, b) violating the partial order 
constraint means there exist some W\,W 2 €E W such that a -< b in W\ and b -< a 
in u> 2 - 

Let ab be two adjacent symbols in a word w. The add_or_break function 
checks whether edge ab is added to the present graph G. If there exists no path 
from b to a, no path from a to b in G and edge ab will not make a connection 
between some p[i\ and q[i], we add edge a —> b in G. Self-loops such as / —¥ f 
are always ignored since they have no influence on the partial order constraints. 
However, if there exist paths from b to a in G, (a,b) ^ (p[i],q[i\), (q[i],p[i]) and 
a,b are not in p[i],g[i] at the same time for all i < len(p), we should break all 
paths from b to a. The breakpoint can be found as below. Suppose there exists 
a path u = bai...a, ao = b in G, and substring of w over {&, a\,...,a} is Oj.-.a, 
then we delete edge a,_i —> a*, add edge /3 —> on for all nodes /3 that /3 —>- 6 , 
and add edge a,;_i —> 7 for all nodes 7 that a —> 7 . In the end, add 
to p,s and add a*...a to q,t. 


0—0*}—-H—*0—-S—"S 

.1 

0 —> 0 — 1 0 — 

Figure 1: This is an example to find the breakpoint 


Example in Figure 1 shows how the function works. W = {Pabcd-y, eda}, 
initialize empty list p,q,s,t and empty graph G. After reading w 1, list p,q,s,t are 
still empty. When reading da £ v'2 , there already exists a path abed and (d, a) 
(p[i],q[i\),(q[i],p[i\). We should break abed. Since substring(w 2 ,{a,b,c,d}) = 
eda , breakpoint is c. Then we delete edge b —> c, and add edges /? — > c,b — > 7 . In 
the end, add ab to p,s and add cd to q,t. 
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1 : function consistent(G,w,p,q) 

2 : s, t : = 0 , i :=s 1 

3: while i < \w\ — 1 do 

4: if w[i ] ^ w[i + 1] A (io[i], w[i + 1]) ^ ( p , q), ( q,p ) then 

5: add-orJbreak{G , w, w[i], w[i + l],p, g, s, t) 

6: for j := 1 to |s| do 

7: if (w[i] € s[j]) A ((f[j][-l], w[i + 1]) i ( p,q )) then 

8 : add_orJrreak(G, w, lastsymbol(t[j]), w[i + 1 ],p,q,s,t) 

9: if (w[i] 6 t[j}) A (s\j] [-1], w[i + 1]) ( p,q )) then 

10: add-orJ>reak(G,w,lastsymbol(s[j]),w[i+ 1 ],p,q,s,t) 

11 : i + + 


The consistent function scans the whole string w by sequence to exe¬ 
cute add_or_break function. Each time after reading two adjacent symbols ab , 
for all pairs (aiaa 2 , a 3 c) or (a 3 c, aqaa^) € (s,t), handle cb likewise. Because 
(aia« 2 i <a 3 c) or (a 3 c, aia« 2 ) G (s, t ) declare a -< c and c -< a are in w, if a -< b in 
w, c -< b is also in iu. Consider acab as an example, c and a have been two parts 
after reading ca, a has been added to p and s and c added to q and t. After read¬ 
ing the next two symbols ab , add edge a —> b. Next we should consider cb since 
a € s[0],c € s[0], thus add edge c —> fe. The topological_sort(g) construct a 
topological ordering of DAG in linear time. The learner_oper is used to infer 
operators ?, +, * for each vertex. 

The conDAG algorithm combines all the functions. The constructed graph is 
denoted by G and the corresponding set of partitions by C. In each iteration, 
it invokes consistent to update G using the ?’th string. Then it adds all the 
paths from the set of vertices of in-degree zero to the set of vertices of out- 
degree zero. To be able to calculate the largest independent partial-order plans, 
a preprocessing phase is implemented. First, we consider the elements of C in 
decreasing order of size. In each iteration, whenever we find two elements that 
the one contains elements of p[i\ and the other one contains elements of q[i], 
we updates the shorter one by removing the common elements. Next, we merge 
all the lists in C that share common elements. The preprocess terminates when 
every symbol is included in one and only one list. The following steps of the 
algorithm are the same as the third and the forth step of the conMiner. 

The time complexity analysis of this algorithm is straightforward. add-orJbrea 
k(G, w , a, b,p, q , s, t) can find all possible paths between two given nodes by mod¬ 
ifying the DFS which needs 0(|y| + |E|) steps. Breaking a circle requires 0{\V\). 
Therefore, an overall time complexity for add-or_break is 0(c|G| + |E1|), where 
c is number of paths between the given nodes in the graph. When there exist 
n(n — 2)/2 inconsistent terms in W, every two symbols are not in a group, which 
is the worst case. When tackling of cq-icq, len(p) = (n — « + l)(n — *)/2, deciding 
whether (a,;_i,ai) € (p[j\, q[j]), {q[j\,p\j]) needs (n — * + l)(n — *) time. Deciding 
whether a,; G s[j],t[j] needs n — i time. There is only one path between two 
nodes, thus c = 1. So the total time of consisitent is ]G™ =2 (n — *) 2 (|E| + |£j) 
where \V\ = n, and |£j is 0(n) according to the analysis above. 
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Algorithm 2 conD AG(W) 

Input: Set of unordered words W = {w \,..., w„} 

Output: a minimal SIRE 
1: Z/ 2 , constraint = tranjreduction(W, T) 

2: initialize graph G, p, q := 0 
3: for i ~ 1 to n do 
4: consistent(G,Wi,p,q) 

5: C = alljpaths(G, source, destination) 

6: remove the common elements from the shorter of c;,Cj £ C if d[m ] + Cj[n] £ 
constraint. 

7: merge all lists that share common elements in C 
8: for each mis in C do 
9: H = Graph(mis, L 2 ) 

10: T.append(topologicalsort(H)) 

11: return learner OV er (IT, T) 


The tr an ^reduction computation requires 0{n 2 ) time, where n is the number 
of distinct symbols. Each iteration requires 0(n 3 ) time to maintain the graph. 
Computing all paths from source to destination can be done in 0(n 2 ) time, and 
topological sorting) constructs a topological ordering of DAG in linear time, thus 
0(|U| + \E\) steps are sufficient. Inference of operators ?, +, * needs time 0{m). 
Hence the time complexity of the algorithm is 0(tn 4 + m), where m is the sum 
of length of the input example strings, n the number of alphabet symbols and t 
the number of strings. 

To illustrate our algorithm, consider the example E = {abed, aadbc, bdd}, 
L 2 = {ab,ac,ad,bc}, constraint = {bd,cd,db,dc} in the above section. A di¬ 
rected graph which consists of vertex V = {a, b, c, d} and edges E = {ab, be, ad} 
can be obtained, p = {bc\ and q = {d}. All paths from source to destination are 
C = {abc, ad}. Since bd £ constraint, C[2] is updated by removing the common 
elements between C[l] and C[2]. C[ 2] is d. The final C is {abc,d}. The following 
steps are the same. 

5 Experiments 

In this section, we validate our approaches on real-life DTDs, and compare them 
with that of Trang [ 8 ]. All experiments were conducted on an IBM T400 laptop 
computer with a Intel Core 2 Duo CPU(2.4GHz) and 2G memory. All codes 
were written in python. 

The number of corpora of XML documents with an interesting schema is 
rather limited. We obtained our real-life DTDs from the XML DATA repository 
maintained by Miklau [18]. Unfortunately, most of them are either not data- 
centric or not with a DTD. Specifically, We chose the DBLP Computer Science 
Bibliography corpus, a data-centric database of information on major computer 
science journals and proceedings. 
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Element 

name 

Sample 

size 

Number of 
interleaving 

Original DTD 

Exact Minimal DTD 

Result of conMiner 

Result of conDAG 

Result of Trang 

Simplified Exact Minimal DTD 

Simplified Result of conMiner 

Simplified Result of conDAG 

inproceedings 

(«1«2 ' ' ' |«22)* 

2122274 

ai*ai2?a5*a9?ai8?ai5*&a3a6an*&oi9*&ai3*&a4&ai4* 

2122274 

a5*ai8?«i5*&ai2?a9?ai3*&ai*ai4*&a6aii*&;a3&M4&;«i9* 

2122274 

ai*a4a9?an*ais*&a3ai2?a5*oi8?&oi3*&a6&ai4*&ai9* 

2122274 

(ai|a3|a5|a6|a9|an|ai2|ai3|ai4|ai5|ai8|«i9) + 

5 

6&3&1&1&1&1 

6 

3&3&2&2&1&1&1 

5 

5&4&1&1&1&1 

article 

(«1«2 ' ' ' |«22)* 

111608 

ai*ai7?a5ai2?ai5*&a3a6«ii?&ai3*&a8&;aio?&ai4*&a9? 

111608 

ai7?ai2?a9?ai5*&ai*a6aii?&a3&a5&ai3*&a8&aio?&ai4* 

111608 

a3*ai7?a6an?&ai*a8ai2?oi5*&ai3*&a5&aio?&ai2?&a9? 

111608 

a2?(oi|a3|a.5|o6|a8|a9|aio|an |ai2|ffli3|fli4|ai5|ai7) + 

6 

5&3&1&1&1&1&1 

7 

4&3&1&1&1&1&1&1 

6 

4&4&1&1&1&1&1 

proceedings 

(«1«2 ' ' ' |«22)* 

3007 

a2*«3 + ai8?a2i?a8?aio?ai3?oi2?oi5*ai9?a7?a9?&:a4?&ai7?&a6&a2o&:aii? 

3007 

a2*a3 + ai9?ai3?a2oai5*ai2?&a4?a7?a8?09?&a2i?ai8?aio?&a6&oi7?&aii? 

3007 

a2*a3 + a8?oi8?a2i?aio?a9?ai9?ai3?a7?ai5*&a4?oi2?&ai7?&a6&a2o&:aii? 

3007 

a2*«3 + (a4|a6|a7|a8|a9|aio|aii| a i2|ai3| a i7|ai8| a i9|a2ol a 2i) + ai5* 

5 

12&1&1&1&1&1 

5 

7&4&3&1&1&1 

5 

11&2&1&1&1&1 

incollection 

(tti « 2 • • • |a 2 2)* 

1009 

ai*a3a4ai7?a2o?ai6?aii?«i5*«i4?&ai3?ai9?&a5?&a6 

1009 

ai*a3ai7?a6&ai5*ai3?ai6?ai4?&a4an?&a20?ai9?&:a5? 

1009 

ai*a3a4Oi7?an?ai5*ai4?&a6a20?&:a5?oi6?&ai3?&ai9? 

1009 

(ai|a3|a4|a.5|a6|an |ai3|ai6|ai7|a2o) + (ai4|ai5*) 

3 

9&2&1&1 

4 

4&4&2&2&1 

4 

7&2&2&1&1 

phdthesis 

(tti « 2 • • • |a 2 2)* 

72 

aia3O6ai7?a2i?a20?a9?ai3?ai2?&a22 

72 

aia3a6ai2?a2i?a22ai3?a20?&ai7?O9? 

72 

aia3a6ai7?a2i?a20?«i3?a9?ai2?&O22 

72 

aia3a6(ai2|a2i)?(a9|ai7|a22) + (ai3|a2o)? 

1 

9&1 

1 

8&2 

1 

9&1 

WWW 

(«1«2 • • • |a22)* 

38 

ai*02*a 3 a4?06?an 

38 

ai*02*a304?a6?an 

38 

ai*02*a3a 4 ?a6?an 

38 

(oi*|a2*)a3a4?a6?on 

0 

6 

0 

6 

0 

6 


Table 1: Results of exact algorithm, conMiner, conDAG and Trang on DTDs 
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Table 1 lists the non-trivial element definitions in the above mentioned DTD 
together with the results derived by exact algorithm, heuristic algorithm con- 
Miner, approximation algorithm conDAG, and Trang. We implement the exact 
algorithm using conMiner by replacing function clique_removal with an expo¬ 
nential time algorithm proposed by S. Tsukiyama [20]. We also list the number 
of interleavings used and the simplified of our results to have a clear view of their 
relationship. The numbers in the first column the first five rows in each element 
refer to the element name and the sample size respectively. The numbers in the 
first column the last three rows in each element refer to the number of interleav¬ 
ings used by the result of exact algorithm, conMiner and conDAG, respectively. 
It can be verified that all expressions learned by exact algorithm, conDAG and 
conMiner are more strict than that of Trang and the original DTDs which in¬ 
dicates there exists much more over-permissive in both the original DTDs and 
the results of Trang. 

We note that there may exist many minimal expressions given a set of un¬ 
ordered strings. For instance, for phdthesis, the form of the result of conDAG 
is the same with the exact minimal expression. The orders among symbols of 
their first siblings, however, differ widely. This is due to the fact that a cliagraph 
may have several different topological sorts. Therefore, we ignore the sequel in 
the symbols and only compare their simplified form. The table shows clearly 
that conDAG yields concise super-approximations to the exact minimal expres¬ 
sions. Although for proceedings, incollection and phdthesis, the expres¬ 
sions produced by conMiner and conDAG have the same number of interleavings, 
conDAG yields longer length of siblings and thus finds solutions of better quality 
as compared to the solutions found by the approximation algorithm. 

6 Conclusion 

This paper proposes a strategy for learning a class of regular expressions with 
interleaving: first, compute consistent partial order T, then equip each factor 
with counting operators. As future work, we will investigate several interesting 
problems inspired by this study. First, we would like to extend our algorithms 
for more expressive schemas, for example schemas allow disjunction “|” within 
siblings. Second, how to extend algorithms to mine all independent frequent 
closed partial orders [17] is also an attractive topic. 
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